Intelligent Fusion of Evidence from Multiple Sources for Text Classification
نویسنده
چکیده
Automatic text classification using current approaches is known to perform poorly when documents are noisy or when limited amounts of textual content is available. Yet, many users need access to such documents, which are found in large numbers in digital libraries and in the WWW. If documents are not classified, they are difficult to find when browsing. Further, searching precision suffers when categories cannot be checked, since many documents may be retrieved that would fail to meet category constraints. In this work, we study how different types of evidence from multiple sources can be intelligently fused to improve classification of text documents into predefined categories. We present a classification framework based on an inductive learning method – Genetic Programming (GP) – to fuse evidence from multiple sources. We show that good classification is possible with documents which are noisy or which have small amounts of text (e.g., short metadata records) – if multiple sources of evidence are fused in an intelligent way. The framework is validated through experiments performed on documents in two testbeds. One is the ACM Digital Library (using a subset available in connection with CITIDEL, part of NSF’s National Science Digital Library). The other is Web data, in particular that portion associated with the Cadê Web directory. Our studies have shown that improvement can be achieved relative to other machine learning approaches if genetic programming methods are combined with classifiers such as kNN. Extensive analysis was performed to study the results generated through the GP-based fusion approach and to understand key factors that promote good classification.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملSensor fusion
Sensor fusion is a method of integrating signals from multiple sources. It allows extracting information from several different sources to integrate them into single signal or information. In many cases sources of information are sensors or other devices that allow for perception or measurement of changing environment. Information received from multiple-sensors is processed using "sensor fusion...
متن کاملNeural Net Learning Issues in Classification of Free Text Documents
In intelligent analysis of large amounts of text, not any single clue indicates reliably that a pattern of interest has been found. When using multiple clues, it is not known how these should be integrated into a decision. In the context of this investigation, we have been using neural nets as parameterized mappings that allow for fusion of higher level clues extracted from free text. By using ...
متن کاملUsing Wavelet Support Vector Machine for Fault Diagnosis of Gearboxes
Identifying fault categories, especially for compound faults, is a challenging task in mechanical fault diagnosis. For this task, this paper proposes a novel intelligent method based on wavelet packet transform (WPT) and multiple classifier fusion. An unexpected damage on the gearbox may break the whole transmission line down. It is therefore crucial for engineers and researchers to monitor the...
متن کاملSituation and Threat Refinement Approach for Combating the Asymmetric Threat
In order to combat the present and future asymmetric threats to national and international security, information fusion developments must progress beyond current Level 1 (Object Refinement) paradigms. By focusing on the challenges of Continuous Intelligence Preparation of the Battlespace (CIPB), Lockheed Martin Advanced Technology Laboratories (ATL) has begun to elicit an infrastructure and ena...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006